{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Extracting partial strings\n",
    "\n",
    "It can sometimes be useful to make comparisons based on substrings or parts of column values. For example, one approach to comparing postcodes is to consider their constituent components, e.g. area, district, etc (see [Featuring Engineering](../data_preparation/feature_engineering.md) for more details).\n",
    "\n",
    "We can use functions such as substrings and regular expressions to enable users to compare strings without needing to engineer new features from source data.\n",
    "\n",
    "Splink supports this functionality via the use of the `ComparisonExpression`.\n",
    "\n",
    "## Examples\n",
    "\n",
    "### 1. Exact match on postcode area\n",
    "\n",
    "Suppose you wish to make comparisons on a postcode column in your data, however only care about finding links between people who share the same area code (given by the first 1 to 2 letters of the postcode). The regular expression to pick out the first two characters is `^[A-Z]{1,2}`:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "NULLIF(regexp_extract(\"postcode_l\", '^[A-Z]{1,2}', 0), '') = NULLIF(regexp_extract(\"postcode_r\", '^[A-Z]{1,2}', 0), '')\n"
     ]
    }
   ],
   "source": [
    "import splink.comparison_level_library as cll\n",
    "from splink import ColumnExpression\n",
    "\n",
    "pc_ce = ColumnExpression(\"postcode\").regex_extract(\"^[A-Z]{1,2}\")\n",
    "print(cll.ExactMatchLevel(pc_ce).get_comparison_level(\"duckdb\").sql_condition)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We may therefore configure a comparison as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Comparison 'CustomComparison' of \"postcode\".\n",
      "Similarity is assessed using the following ComparisonLevels:\n",
      "    - 'postcode is NULL' with SQL rule: \"postcode_l\" IS NULL OR \"postcode_r\" IS NULL\n",
      "    - 'Exact match on transformed postcode' with SQL rule: NULLIF(regexp_extract(\"postcode_l\", '^[A-Z]{1,2}', 0), '') = NULLIF(regexp_extract(\"postcode_r\", '^[A-Z]{1,2}', 0), '')\n",
      "    - 'All other comparisons' with SQL rule: ELSE\n",
      "\n"
     ]
    }
   ],
   "source": [
    "from splink.comparison_library import CustomComparison\n",
    "\n",
    "cc = CustomComparison(\n",
    "    output_column_name=\"postcode\",\n",
    "    comparison_levels=[\n",
    "        cll.NullLevel(\"postcode\"),\n",
    "        cll.ExactMatchLevel(pc_ce),\n",
    "        cll.ElseLevel()\n",
    "    ]\n",
    "\n",
    ")\n",
    "print(cc.get_comparison(\"duckdb\").human_readable_description)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "| person_id_l | person_id_r | postcode_l | postcode_r | comparison_level |\n",
    "|-------------|-------------|------------|------------|------------------|\n",
    "| 7           | 1           | **SE**1P 0NY   | **SE**1P 0NY   | exact match      |\n",
    "| 5           | 1           | **SE**2 4UZ    | **SE**1P 0NY   | exact match      |\n",
    "| 9           | 2           | **SW**14 7PQ   | **SW**3 9JG    | exact match      |\n",
    "| 4           | 8           | **N**7 8RL     | **EC**2R 8AH   | else level       |\n",
    "| 6           | 3           |            | **SE**2 4UZ    | null level       |"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2. Exact match on initial\n",
    "\n",
    "In this example we use the `.substr` function to create a comparison level based on the first letter of a column value. \n",
    "\n",
    "Note that the `substr` function is 1-indexed, so the first character is given by `substr(1, 1)`:  The first two characters would be given by `substr(1, 2)`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "SUBSTRING(\"first_name_l\", 1, 1) = SUBSTRING(\"first_name_r\", 1, 1)\n"
     ]
    }
   ],
   "source": [
    "import splink.comparison_level_library as cll\n",
    "from splink import ColumnExpression\n",
    "\n",
    "initial = ColumnExpression(\"first_name\").substr(1,1)\n",
    "print(cll.ExactMatchLevel(initial).get_comparison_level(\"duckdb\").sql_condition)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Additional info\n",
    "\n",
    "Regular expressions containing “\\” (the python escape character) are tricky to make work with the Spark linker due to escaping so consider using alternative syntax, for example replacing “\\d” with “[0-9]”.\n",
    "\n",
    "Different regex patterns can achieve the same result but with more or less efficiency. You might want to consider optimising your regular expressions to improve performance (see [here](https://www.loggly.com/blog/regexes-the-bad-better-best/), for example)."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": ".venv",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
